Boston, known as the "Hub of the Universe," is a city with a rich history and vibrant culture. Established in 1630, it played a significant role in the American Revolution and boasts numerous historical landmarks. Today, Boston is a thriving metropolis, home to prestigious universities like Harvard and MIT, fostering a climate of innovation and intellectual curiosity. The city's diverse neighborhoods each have their own unique charm, from the cobblestone streets of Beacon Hill to the bustling multicultural hub of Chinatown. With world-class museums, beautiful parks, passionate sports fans, and a lively arts scene, Boston offers a captivating blend of tradition and modernity, making it a captivating destination for residents and visitors alike.
In Boston, the administrative hierarchy can be represented as follows:
Country: United States
- The highest level of administrative division, encompassing the entire country.
State: Massachusetts
- The state in which Boston is located.
County: Suffolk County
- The county in which Boston is located. Suffolk County includes the city of Boston and some neighboring areas.
City: Boston
- The city of Boston itself, which is the capital and largest city of Massachusetts.
Neighborhoods/Districts: Boston is further divided into several neighborhoods or districts, each with its own characteristics and local governance. Some of the well-known neighborhoods in Boston include:
A1: Downtown
A15: Charlestown
A7: East Boston
B2: Roxbury
B3: Mattapan
C6: South Boston
C11: Dorchester
D4: South End
D14: Brighton
E5: West Roxbury
E13: Jamaica Plain
E18: Hyde Park
These administrative divisions outline the hierarchical structure of Boston's governance and provide a framework for managing and providing services to different areas within the city.
from encodings.aliases import aliases # Python has a file containing a dictionary of encoding names and associated aliases
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import plotly.express as px
import scipy.stats as stat
import seaborn as sns
import pandas as pd
import numpy as np
import calendar
import json
%matplotlib inline
# To find encodings that work, below line creates a set of all available encodings for the specific file and we can use of them;
# Or we can directly open the csv with notepad and know the encoding at the bottom right corner;
alias_values = set(aliases.values())
for encoding in set(aliases.values()):
try:
df = pd.read_csv("Miscellaneous/crime.csv", nrows = 5, encoding = encoding)
print('successful', encoding)
except:
pass
successful cp273 successful cp1258 successful iso8859_15 successful cp037 successful cp858 successful gbk successful iso8859_9 successful iso8859_7 successful cp775 successful cp863 successful cp1252 successful cp857 successful mac_greek successful cp855 successful cp437 successful cp949 successful mac_roman successful kz1048 successful mbcs successful ptcp154 successful cp852 successful iso8859_10 successful cp1254 successful hp_roman8 successful mac_turkish successful cp869 successful cp1251 successful cp1026 successful cp1255 successful mac_latin2 successful cp850 successful cp1256 successful cp865 successful cp861 successful cp1140 successful cp1250 successful utf_8 successful iso8859_2 successful iso8859_16 successful iso8859_11 successful gb18030 successful cp1257 successful cp1125 successful iso8859_6 successful mac_iceland successful cp862 successful cp932 successful koi8_r successful cp500 successful iso8859_14 successful iso8859_13 successful cp866 successful mac_cyrillic successful iso8859_4 successful latin_1 successful cp1253 successful cp864 successful cp860 successful iso8859_5 successful iso8859_3
# I have used ansi as encoding here as I have found the exact encoding by opening the file in notepad;
# If you don't know the exact encoding go for the above method and use any of the above encodings
crime_df = pd.read_csv("Miscellaneous/crime.csv", encoding = "ANSI", low_memory = False)
crime_df.head()
| INCIDENT_NUMBER | OFFENSE_CODE | OFFENSE_CODE_GROUP | OFFENSE_DESCRIPTION | DISTRICT | REPORTING_AREA | SHOOTING | OCCURRED_ON_DATE | YEAR | MONTH | DAY_OF_WEEK | HOUR | UCR_PART | STREET | Lat | Long | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | I192074715 | 2629 | Harassment | HARASSMENT | B2 | 278 | NaN | 2018-01-01 00:00:00 | 2018 | 1 | Monday | 0 | Part Two | HARRISON AVE | 42.331538 | -71.080157 | (42.33153805, -71.08015661) |
| 1 | I192068538 | 1107 | Fraud | FRAUD - IMPERSONATION | D14 | 794 | NaN | 2018-01-01 00:00:00 | 2018 | 1 | Monday | 0 | Part Two | GLENVILLE AVE | 42.349780 | -71.134230 | (42.34977988, -71.13423049) |
| 2 | I192005657 | 2610 | Other | TRESPASSING | C11 | 396 | NaN | 2018-01-01 00:00:00 | 2018 | 1 | Monday | 0 | Part Two | MELBOURNE ST | 42.291093 | -71.065945 | (42.29109287, -71.06594539) |
| 3 | I192075335 | 3208 | Property Lost | PROPERTY - MISSING | D4 | 132 | NaN | 2018-01-01 00:00:00 | 2018 | 1 | Monday | 0 | Part Three | COMMONWEALTH AVE | 42.353522 | -71.072838 | (42.35352153, -71.07283786) |
| 4 | I192013179 | 619 | Larceny | LARCENY ALL OTHERS | C11 | 360 | NaN | 2018-01-01 00:00:00 | 2018 | 1 | Monday | 0 | Part One | CENTERVALE PARK | 42.296323 | -71.063569 | (42.29632282, -71.06356881) |
# Creating copy to rollback to samepoint incase if needed
crime = crime_df.copy()
crime.head(1)
| INCIDENT_NUMBER | OFFENSE_CODE | OFFENSE_CODE_GROUP | OFFENSE_DESCRIPTION | DISTRICT | REPORTING_AREA | SHOOTING | OCCURRED_ON_DATE | YEAR | MONTH | DAY_OF_WEEK | HOUR | UCR_PART | STREET | Lat | Long | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | I192074715 | 2629 | Harassment | HARASSMENT | B2 | 278 | NaN | 2018-01-01 00:00:00 | 2018 | 1 | Monday | 0 | Part Two | HARRISON AVE | 42.331538 | -71.080157 | (42.33153805, -71.08015661) |
crime.shape
(98888, 17)
crime.size # Size matches shape
1681096
crime.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 98888 entries, 0 to 98887 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 INCIDENT_NUMBER 98888 non-null object 1 OFFENSE_CODE 98888 non-null int64 2 OFFENSE_CODE_GROUP 98888 non-null object 3 OFFENSE_DESCRIPTION 98888 non-null object 4 DISTRICT 98206 non-null object 5 REPORTING_AREA 98888 non-null object 6 SHOOTING 402 non-null object 7 OCCURRED_ON_DATE 98888 non-null object 8 YEAR 98888 non-null int64 9 MONTH 98888 non-null int64 10 DAY_OF_WEEK 98888 non-null object 11 HOUR 98888 non-null int64 12 UCR_PART 98868 non-null object 13 STREET 97274 non-null object 14 Lat 92133 non-null float64 15 Long 92133 non-null float64 16 Location 92133 non-null object dtypes: float64(2), int64(4), object(11) memory usage: 12.8+ MB
total_memory_usage = crime.memory_usage().sum() / 1024**2
print(f"Total memory usage: {total_memory_usage:.2f} MB")
Total memory usage: 12.83 MB
# Changing case to lowercase for column names for convenience of calling
print(f'Before: {crime.columns}')
crime.columns = [x.lower() for x in crime.columns]
print(f'After: {crime.columns}')
Before: Index(['INCIDENT_NUMBER', 'OFFENSE_CODE', 'OFFENSE_CODE_GROUP',
'OFFENSE_DESCRIPTION', 'DISTRICT', 'REPORTING_AREA', 'SHOOTING',
'OCCURRED_ON_DATE', 'YEAR', 'MONTH', 'DAY_OF_WEEK', 'HOUR', 'UCR_PART',
'STREET', 'Lat', 'Long', 'Location'],
dtype='object')
After: Index(['incident_number', 'offense_code', 'offense_code_group',
'offense_description', 'district', 'reporting_area', 'shooting',
'occurred_on_date', 'year', 'month', 'day_of_week', 'hour', 'ucr_part',
'street', 'lat', 'long', 'location'],
dtype='object')
# Checking for duplicated rows if any and remove
print(f'Before: {crime.shape}')
print(crime.duplicated().sum())
crime.drop_duplicates(inplace=True)
print(f'After: {crime.shape}')
Before: (98888, 17) 161 After: (98727, 17)
# Changing the dtype from object to datetime for the occurred on date column
print(f"Before: {crime['occurred_on_date'].dtype}")
crime['occurred_on_date'] = pd.to_datetime(crime['occurred_on_date'])
print(f"After: {crime['occurred_on_date'].dtype}")
Before: object After: datetime64[ns]
# Changing the numeric value of months to month names to make it easier to understand plots
crime['month'] = crime['month'].apply(lambda x: calendar.month_name[x])
# Feature Engineering
# To understand the pattern of seasonal crimes
seasons = {'January': 'Winter', 'February': 'Winter', 'March': 'Winter',
'April': 'Spring', 'May': 'Spring', 'June': 'Spring',
'July': 'Summer', 'August': 'Summer', 'September': 'Summer',
'October': 'Fall', 'November': 'Fall', 'December': 'Fall'}
season_index = crime.columns.get_loc('month')
crime.insert(season_index + 1, 'season', crime['month'].map(seasons))
# Feature Engineering
# For better understandability of plots, breaking down district codes into district names
dist_names = {
'A1': 'Downtown',
'A15': 'Charlestown',
'A7': 'East Boston',
'B2': 'Roxbury',
'B3': 'Mattapan',
'C6': 'South Boston',
'C11': 'Dorchester',
'D4': 'South End',
'D14': 'Brighton',
'E5': 'West Roxbury',
'E13': 'Jamaica Plain',
'E18': 'Hyde Park'
}
crime['district'] = crime['district'].replace(dist_names)
crime.head(1)
| incident_number | offense_code | offense_code_group | offense_description | district | reporting_area | shooting | occurred_on_date | year | month | season | day_of_week | hour | ucr_part | street | lat | long | location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | I192074715 | 2629 | Harassment | HARASSMENT | Roxbury | 278 | NaN | 2018-01-01 | 2018 | January | Winter | Monday | 0 | Part Two | HARRISON AVE | 42.331538 | -71.080157 | (42.33153805, -71.08015661) |
# Outlier check on date
print(min(crime['occurred_on_date']), max(crime['occurred_on_date']), sep = 2 * '\n')
# It is well within the daterange;
2018-01-01 00:00:00 2018-12-31 23:45:00
# Breaking down datetime into date & time, taking just date,as hour column is already there & dropping the occurred_on_date column
dt_index = crime.columns.get_loc('occurred_on_date')
crime.insert(dt_index + 1, 'date', crime['occurred_on_date'].dt.date)
crime.drop(columns = 'occurred_on_date', inplace = True)
crime.head(1)
| incident_number | offense_code | offense_code_group | offense_description | district | reporting_area | shooting | date | year | month | season | day_of_week | hour | ucr_part | street | lat | long | location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | I192074715 | 2629 | Harassment | HARASSMENT | Roxbury | 278 | NaN | 2018-01-01 | 2018 | January | Winter | Monday | 0 | Part Two | HARRISON AVE | 42.331538 | -71.080157 | (42.33153805, -71.08015661) |
# Checking for number of unique values in each column
crime.nunique()
incident_number 86734 offense_code 184 offense_code_group 61 offense_description 185 district 12 reporting_area 877 shooting 1 date 365 year 1 month 12 season 4 day_of_week 7 hour 24 ucr_part 4 street 3579 lat 13054 long 13055 location 13062 dtype: int64
# Defining function to check for columns with missing values
def null_cols(df):
missing = df.columns[df.isnull().sum() != 0]
return missing
# Checking for the number of missing values in each column
num_of_null_vals = crime[null_cols(crime)].isnull().sum()
num_of_null_vals
district 682 shooting 98408 ucr_part 20 street 1609 lat 6747 long 6747 location 6747 dtype: int64
num_of_null_vals / len(crime) * 100
district 0.690794 shooting 99.676887 ucr_part 0.020258 street 1.629747 lat 6.833997 long 6.833997 location 6.833997 dtype: float64
Column Dropping:
incident_number won't do any good for my analysis IMO;
As we have offense_code_group, I don't think offense_code would help me out in anyway;
offense_description column as well here is unnecessary as we wouldn't be using it anywhere in our analysis;
year as we know this is 2018 Boston Crime Dataset and as we don't have multiple years here, can drop it as well;
location - As we have lat and long columns seperately I am dropping this;
Here as shooting column has almost 100% null values we have no other option than dropping it altogether;
Row Dropping:
district, street, latitude and longitude- all of these columns gives essential info about the location of crimes. Without which the entire row would be of no use in most cases where we try to analyse the crime location. As they are in negligible proportion, so as to get a clean data we can consider dropping those rows with null values altogether.
These rows can't be filled in with some estimated values and if you try to fill in with precise values you will end up doing a tedious job.
The ucr_part - column refers to the Uniform Crime Reporting Offense types. The UCR classification system divides offenses into four categories: Part One, Part Two, Part Three, and Others. Part One offenses are considered the most severe and include crimes such as Larceny/Robbery, Assault, and Breaking & Entering. Most of the crimes in Boston on 2018 were classified as UCR Part Three, while ‘Others’ was the least classification with a proportion of just 0.4%.
ucr partas well has null values in neglible proportion of the dataset. so, we can drop it. Again this can be filled in with precise values.But you will end up doing tedious job which would go out of context;
Note 1: While dropping rows with null values, first drop w/o inplace argument and see the shape of data. If it looks fine then proceed with inplace arg.
Note 2: This decision is subjective and may differ from one analyst’s perspective to another and differ based on context of analysis
print(f'Before: {crime.shape}')
crime.drop(columns = ['incident_number', 'offense_code', 'offense_description', 'year', 'location', 'shooting'], inplace = True)
print(f'After: {crime.shape}')
Before: (98727, 18) After: (98727, 12)
null_cols(crime)
Index(['district', 'ucr_part', 'street', 'lat', 'long'], dtype='object')
crime.dropna(axis = 0, ignore_index = True) # Haven't put inplace = True; Just checking dimensions
| offense_code_group | district | reporting_area | date | month | season | day_of_week | hour | ucr_part | street | lat | long | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Harassment | Roxbury | 278 | 2018-01-01 | January | Winter | Monday | 0 | Part Two | HARRISON AVE | 42.331538 | -71.080157 |
| 1 | Fraud | Brighton | 794 | 2018-01-01 | January | Winter | Monday | 0 | Part Two | GLENVILLE AVE | 42.349780 | -71.134230 |
| 2 | Other | Dorchester | 396 | 2018-01-01 | January | Winter | Monday | 0 | Part Two | MELBOURNE ST | 42.291093 | -71.065945 |
| 3 | Property Lost | South End | 132 | 2018-01-01 | January | Winter | Monday | 0 | Part Three | COMMONWEALTH AVE | 42.353522 | -71.072838 |
| 4 | Larceny | Dorchester | 360 | 2018-01-01 | January | Winter | Monday | 0 | Part One | CENTERVALE PARK | 42.296323 | -71.063569 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 91316 | Medical Assistance | Hyde Park | 555 | 2018-12-31 | December | Fall | Monday | 23 | Part Three | POPLAR ST | 42.277147 | -71.125124 |
| 91317 | Vandalism | West Roxbury | 564 | 2018-12-31 | December | Fall | Monday | 23 | Part Two | WASHINGTON ST | 42.294217 | -71.119853 |
| 91318 | Medical Assistance | Brighton | 773 | 2018-12-31 | December | Fall | Monday | 23 | Part Three | KIRKWOOD RD | 42.341269 | -71.157506 |
| 91319 | Investigate Property | South End | 620 | 2018-12-31 | December | Fall | Monday | 23 | Part Three | BOYLSTON ST | 42.347102 | -71.088417 |
| 91320 | Vandalism | Roxbury | 318 | 2018-12-31 | December | Fall | Monday | 23 | Part Two | BROOKLEDGE ST | 42.308412 | -71.088381 |
91321 rows × 12 columns
print(f'Before: {crime.shape}')
crime.dropna(axis = 0, inplace = True, ignore_index = True)
print(f'After: {crime.shape}')
Before: (98727, 12) After: (91321, 12)
null_cols(crime)
Index([], dtype='object')
crime.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 91321 entries, 0 to 91320 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 offense_code_group 91321 non-null object 1 district 91321 non-null object 2 reporting_area 91321 non-null object 3 date 91321 non-null object 4 month 91321 non-null object 5 season 91321 non-null object 6 day_of_week 91321 non-null object 7 hour 91321 non-null int64 8 ucr_part 91321 non-null object 9 street 91321 non-null object 10 lat 91321 non-null float64 11 long 91321 non-null float64 dtypes: float64(2), int64(1), object(9) memory usage: 8.4+ MB
crime.columns
Index(['offense_code_group', 'district', 'reporting_area', 'date', 'month',
'season', 'day_of_week', 'hour', 'ucr_part', 'street', 'lat', 'long'],
dtype='object')
crime.describe(include='object') # Knowing numeric stats holds no value here so going through just object dtype;
| offense_code_group | district | reporting_area | date | month | season | day_of_week | ucr_part | street | |
|---|---|---|---|---|---|---|---|---|---|
| count | 91321 | 91321 | 91321 | 91321 | 91321 | 91321 | 91321 | 91321 | 91321 |
| unique | 59 | 12 | 876 | 365 | 12 | 4 | 7 | 4 | 3241 |
| top | Motor Vehicle Accident Response | Roxbury | 111 | 2018-06-15 | May | Spring | Friday | Part Three | WASHINGTON ST |
| freq | 9276 | 14412 | 692 | 342 | 8340 | 24101 | 13819 | 46698 | 4492 |
offense_prop = (crime['offense_code_group'].value_counts(
normalize = True).head(15) * 100).to_frame().reset_index().rename(
columns = {'offense_code_group': 'offense'})
g = sns.FacetGrid(data = offense_prop, height = 9, aspect = 1.2)
g.map(sns.barplot, 'proportion', 'offense', order = offense_prop['offense'], palette = 'Reds_r', orient = 'h')
g.set(ylabel = '', xlabel = 'Percentage')
# Removing left yticks and labels and placing it over the right
g.despine(left = True)
plt.tick_params(axis = 'y', which = 'both', left = False) # both means both major and minor ticks
plt.gca().yaxis.set_label_position('right')
plt.gca().yaxis.tick_right()
plt.show()
offense_prop['proportion'].sum()
77.21006121264551
Of the total 59 types of offense, top 15 types(1/4 th of offense types) amounts to more than 3/4 th of the crimes occured
labels = ['Part 1', 'Part 2', 'Part 3', 'Others']
values = crime['ucr_part'].value_counts()
fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=0.6, textinfo='label+percent')])
fig.update_layout(
title='Crime proportion under each UCR part',
showlegend=False, height=600
)
fig.show()
The above information would be used for legislation in the US parliament after accumulating the same from every part of the nation. As I described few cells above, part 1 crimes are the most serious crimes and that being the major type looks like an area that needs more attention from legislative body and law enforcement agencies;
with open('Miscellaneous/map.geojson') as map:
geojson = json.load(map)
vals = crime['district'].value_counts().reset_index()
merged_data = pd.merge(crime, vals, on='district', how='left')
fig = px.choropleth(merged_data,locations='district', geojson=geojson, featureidkey='properties.name', color='count', color_continuous_scale='Reds')
fig.update_geos(fitbounds='locations', visible=False)
fig.update_layout(title='Crime distribution pattern across districts', coloraxis_colorbar=dict(title='Count'))
fig.update_traces(hovertemplate='<b>District</b>: %{location}<br><b>Number of Crimes</b>: %{z}<extra></extra>')
fig.show()
fig = px.scatter_mapbox(
merged_data.drop(['date', 'offense_code_group', 'reporting_area', 'street'], axis=1),
lat='lat',
lon='long',
zoom=10,
mapbox_style='carto-positron',
hover_name=merged_data['district'],
hover_data={
'date': merged_data['date'], # Include 'date' in hover_data
'offense_code_group': merged_data['offense_code_group'], # Include 'offense_code_group' in hover_data
'reporting_area': merged_data['reporting_area'],
'street': merged_data['street']
}
)
fig.update_geos(fitbounds='locations', visible=False)
fig.update_layout(
title='Crime distribution pattern across locations'
)
fig.update_traces(
hovertemplate='<b>District</b>: %{hovertext}<br>'
'<b>Date</b>: %{customdata[0]}<br>'
'<b>Offense Code Group</b>: %{customdata[1]}<br>'
'<b>Reporting Area</b>: %{customdata[2]}<br>'
'<b>Street</b>: %{customdata[3]}<extra></extra>'
)
fig.show()
crime_by_month = crime['month'].value_counts().sort_index().reset_index()
crime_by_month = crime_by_month.sort_values('count', ascending=True)
average_crime = crime_by_month['count'].mean()
fig = px.bar(crime_by_month, y='month', x='count', orientation='h', color='count',
color_continuous_scale='viridis_r')
fig.add_shape(type="line",
x0=average_crime, y0=-0.5,
x1=average_crime, y1=len(crime_by_month) - 0.5,
line=dict(color="red", width=2, dash="dash"))
fig.add_annotation(x=average_crime, y=len(crime_by_month) - 0.5,
text=f'Average: {average_crime:.0f}',
showarrow=True, arrowhead=1, ax=-50, ay=-40)
fig.update_layout(
title='Crime Count by Month',
xaxis_title='Number of Crimes',
yaxis_title='Month',
xaxis=dict(showgrid=False),
yaxis=dict(showgrid=False),
coloraxis_colorbar=dict(title='Count')
)
fig.show()
As the above plot hasn't gave much information I am going for the plot below
grouped_data = merged_data.groupby(['offense_code_group', 'month'])['count'].size().reset_index()
grouped_data.rename(
{'offense_code_group': 'Offense Code Group', 'month': 'Month', 'count': 'Count'},
axis=1, inplace=True
)
fig = px.bar(
grouped_data, x='Month', y='Count',
animation_frame='Offense Code Group',
range_y=[0, grouped_data['Count'].max() + 100]
)
fig.update_layout(title='Monthly Crime Count by Offense Code Group',
xaxis_title='Month',
yaxis_title='Crime Count')
fig.layout.updatemenus[0].buttons[0].args[1]['frame']['duration'] = 1500
fig.show()
crime_by_season = crime['season'].value_counts().reset_index()
crime_by_season.columns = ['Season', 'Count']
fig = px.pie(crime_by_season, values='Count', names='Season', hole=0.6)
fig.update_layout(
title='Crime Proportion by Season'
)
fig.show()
The above plot shows that overall crime count is evenly distributed across seasons so I'm breaking it down further below
grouped_data = merged_data.groupby(['offense_code_group', 'season'])['count'].size().reset_index()
grouped_data.rename(
{'offense_code_group': 'Offense Code Group', 'season': 'Season', 'count': 'Count'},
axis = 1, inplace = True
)
fig = px.bar(
grouped_data, x='Season', y='Count',
animation_frame='Offense Code Group',
range_y=[0, grouped_data['Count'].max() + 100]
)
fig.update_layout(title='Seasonal Crime Count by Offense Code Group',
xaxis_title='Season',
yaxis_title='Crime Count')
fig.layout.updatemenus[0].buttons[0].args[1]['frame']['duration'] = 1500
fig.show()
crime_by_district = merged_data.groupby('district')['count'].mean().reset_index()
crime_by_district = crime_by_district.sort_values('count', ascending=True)
overall_average_crime = crime_by_district['count'].mean()
fig = px.bar(crime_by_district, y='district', x='count', color='count',
color_continuous_scale='viridis_r', orientation='h')
fig.add_shape(type="line",
x0=overall_average_crime, y0=-0.5,
x1=overall_average_crime, y1=len(crime_by_district) - 0.5,
line=dict(color="red", width=2, dash="dash"))
fig.add_annotation(x=overall_average_crime, y=len(crime_by_district) - 0.5,
text=f'Average: {overall_average_crime:.0f}',
showarrow=True, arrowhead=1, ax=-50, ay=-40)
fig.update_layout(
title='Crime Count by District',
xaxis_title='Number of Crimes',
yaxis_title='District',
xaxis=dict(showgrid=False),
yaxis=dict(showgrid=False),
coloraxis_colorbar=dict(title='Count')
)
fig.show()
crime_by_day = crime['day_of_week'].value_counts().sort_index().reset_index()
crime_by_day.columns = ['Day of Week', 'Count']
crime_by_day = crime_by_day.sort_values('Count', ascending=False)
fig = px.bar(crime_by_day, x='Day of Week', y='Count', color='Count',
color_continuous_scale='cividis_r')
fig.update_layout(
title='Crime Count by Day of Week',
xaxis_title='Day of Week',
yaxis_title='Number of Crimes'
)
fig.show()
crime_by_hour = crime['hour'].value_counts().sort_index().reset_index()
crime_by_hour.columns = ['Hour', 'Count']
fig = px.bar(crime_by_hour, x='Hour', y='Count', animation_frame='Hour',
range_x=[0, 23], range_y=[0, crime_by_hour['Count'].max()],
labels={'Hour': 'Hour of the Day', 'Count': 'Number of Crimes'})
fig.update_layout(
title='Crime Count by Hour of the Day',
xaxis_title='Hour of the Day',
yaxis_title='Number of Crimes',
yaxis=dict(title_standoff=0)
)
fig.layout.updatemenus[0].buttons[0].args[1]['frame']['duration'] = 1500
fig.show()
fig, ax = plt.subplots(figsize=(10, 6))
week_and_hour = crime.groupby(['hour', 'day_of_week']).count()['offense_code_group'].unstack()
week_and_hour.columns = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
heatmap = sns.heatmap(week_and_hour, cmap=sns.cubehelix_palette(as_cmap=True), ax=ax)
heatmap.set_yticklabels(heatmap.get_yticklabels(), rotation=0)
plt.xlabel('')
plt.ylabel('Hour')
plt.show()
offense_counts = crime['offense_code_group'].value_counts().to_frame()
off_unique = crime.groupby('offense_code_group').nunique()
off_unique.insert(0, 'count', offense_counts.iloc[:, 0])
off_unique = off_unique.sort_values('count', ascending=False)
off_unique.columns = [column.replace('_', ' ').title() for column in off_unique.columns]
off_unique.rename_axis('Offense Code Group', axis='index', inplace=True)
off_unique.head(len(off_unique))
| Count | District | Reporting Area | Date | Month | Season | Day Of Week | Hour | Ucr Part | Street | Lat | Long | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Offense Code Group | ||||||||||||
| Motor Vehicle Accident Response | 9276 | 12 | 830 | 365 | 12 | 4 | 7 | 24 | 1 | 1665 | 5385 | 5382 |
| Larceny | 7831 | 12 | 747 | 365 | 12 | 4 | 7 | 24 | 1 | 1179 | 2850 | 2850 |
| Medical Assistance | 7822 | 12 | 817 | 365 | 12 | 4 | 7 | 24 | 1 | 1697 | 3954 | 3953 |
| Other | 5254 | 12 | 760 | 365 | 12 | 4 | 7 | 24 | 3 | 1302 | 2890 | 2889 |
| Investigate Person | 5238 | 12 | 776 | 365 | 12 | 4 | 7 | 24 | 1 | 1433 | 3099 | 3099 |
| Simple Assault | 4953 | 12 | 719 | 365 | 12 | 4 | 7 | 24 | 1 | 1121 | 2610 | 2610 |
| Verbal Disputes | 4362 | 12 | 649 | 365 | 12 | 4 | 7 | 24 | 1 | 1177 | 2261 | 2261 |
| Drug Violation | 4111 | 12 | 563 | 362 | 12 | 4 | 7 | 24 | 1 | 709 | 1514 | 1514 |
| Vandalism | 4080 | 12 | 737 | 365 | 12 | 4 | 7 | 24 | 1 | 1328 | 2876 | 2875 |
| Investigate Property | 3575 | 12 | 719 | 365 | 12 | 4 | 7 | 24 | 1 | 1143 | 2282 | 2283 |
| Towed | 3488 | 12 | 657 | 363 | 12 | 4 | 7 | 24 | 1 | 1085 | 2305 | 2305 |
| Property Lost | 3341 | 12 | 667 | 365 | 12 | 4 | 7 | 24 | 1 | 876 | 1939 | 1940 |
| Larceny From Motor Vehicle | 2908 | 12 | 682 | 364 | 12 | 4 | 7 | 24 | 1 | 1113 | 2215 | 2215 |
| Aggravated Assault | 2254 | 12 | 556 | 363 | 12 | 4 | 7 | 24 | 1 | 740 | 1490 | 1490 |
| Fraud | 2016 | 12 | 655 | 358 | 12 | 4 | 7 | 24 | 1 | 894 | 1549 | 1550 |
| Warrant Arrests | 1990 | 12 | 498 | 363 | 12 | 4 | 7 | 24 | 1 | 597 | 1186 | 1187 |
| Missing Person Located | 1695 | 12 | 450 | 362 | 12 | 4 | 7 | 24 | 1 | 609 | 914 | 914 |
| Violations | 1359 | 12 | 479 | 348 | 12 | 4 | 7 | 24 | 1 | 488 | 1035 | 1035 |
| Residential Burglary | 1297 | 12 | 471 | 348 | 12 | 4 | 7 | 24 | 1 | 684 | 1016 | 1016 |
| Harassment | 1287 | 12 | 541 | 347 | 12 | 4 | 7 | 24 | 1 | 648 | 1004 | 1004 |
| Auto Theft | 1240 | 12 | 501 | 345 | 12 | 4 | 7 | 24 | 1 | 609 | 1038 | 1038 |
| Property Found | 1187 | 12 | 448 | 338 | 12 | 4 | 7 | 24 | 1 | 437 | 763 | 762 |
| Robbery | 1076 | 12 | 403 | 345 | 12 | 4 | 7 | 24 | 1 | 407 | 831 | 831 |
| Police Service Incidents | 1036 | 12 | 415 | 333 | 12 | 4 | 7 | 24 | 1 | 445 | 718 | 718 |
| Missing Person Reported | 883 | 12 | 312 | 326 | 12 | 4 | 7 | 24 | 2 | 399 | 540 | 540 |
| Confidence Games | 842 | 12 | 420 | 315 | 12 | 4 | 7 | 24 | 1 | 421 | 686 | 686 |
| Disorderly Conduct | 574 | 12 | 262 | 278 | 12 | 4 | 7 | 24 | 1 | 236 | 407 | 407 |
| Fire Related Reports | 546 | 12 | 354 | 282 | 12 | 4 | 7 | 24 | 2 | 348 | 500 | 500 |
| License Violation | 514 | 12 | 194 | 213 | 12 | 4 | 7 | 20 | 1 | 143 | 315 | 315 |
| Restraining Order Violations | 469 | 12 | 233 | 261 | 12 | 4 | 7 | 24 | 1 | 268 | 338 | 338 |
| Firearm Violations | 460 | 12 | 251 | 246 | 12 | 4 | 7 | 24 | 1 | 243 | 361 | 361 |
| Counterfeiting | 404 | 12 | 269 | 244 | 12 | 4 | 7 | 24 | 1 | 207 | 335 | 336 |
| Recovered Stolen Property | 385 | 12 | 248 | 232 | 12 | 4 | 7 | 24 | 1 | 242 | 347 | 347 |
| Landlord/Tenant Disputes | 348 | 12 | 180 | 218 | 12 | 4 | 7 | 24 | 1 | 212 | 250 | 250 |
| Liquor Violation | 320 | 11 | 97 | 172 | 12 | 4 | 7 | 21 | 1 | 83 | 155 | 155 |
| Auto Theft Recovery | 320 | 12 | 218 | 210 | 12 | 4 | 7 | 23 | 1 | 226 | 288 | 288 |
| Commercial Burglary | 313 | 12 | 190 | 198 | 12 | 4 | 7 | 24 | 1 | 133 | 247 | 247 |
| Property Related Damage | 263 | 12 | 216 | 161 | 12 | 4 | 7 | 24 | 1 | 198 | 253 | 253 |
| Ballistics | 259 | 12 | 162 | 185 | 12 | 4 | 7 | 24 | 1 | 199 | 239 | 239 |
| Search Warrants | 243 | 12 | 142 | 140 | 12 | 4 | 7 | 24 | 1 | 145 | 179 | 179 |
| Assembly or Gathering Violations | 184 | 12 | 97 | 129 | 12 | 4 | 7 | 24 | 1 | 104 | 140 | 140 |
| License Plate Related Incidents | 182 | 12 | 148 | 139 | 12 | 4 | 7 | 21 | 2 | 150 | 167 | 167 |
| Firearm Discovery | 177 | 12 | 128 | 139 | 12 | 4 | 7 | 23 | 1 | 133 | 148 | 148 |
| Offenses Against Child / Family | 124 | 12 | 94 | 96 | 12 | 4 | 7 | 22 | 1 | 95 | 112 | 112 |
| Other Burglary | 123 | 12 | 100 | 103 | 12 | 4 | 7 | 24 | 1 | 89 | 111 | 111 |
| Operating Under the Influence | 113 | 12 | 98 | 97 | 12 | 4 | 7 | 19 | 1 | 91 | 112 | 112 |
| Evading Fare | 108 | 12 | 93 | 94 | 12 | 4 | 7 | 24 | 1 | 77 | 102 | 102 |
| Embezzlement | 93 | 11 | 73 | 82 | 12 | 4 | 7 | 17 | 1 | 58 | 81 | 81 |
| Prisoner Related Incidents | 84 | 10 | 66 | 75 | 12 | 4 | 7 | 22 | 2 | 58 | 71 | 71 |
| Service | 71 | 12 | 68 | 65 | 12 | 4 | 7 | 18 | 1 | 62 | 70 | 70 |
| Homicide | 50 | 10 | 45 | 47 | 12 | 4 | 7 | 19 | 1 | 47 | 50 | 50 |
| Criminal Harassment | 41 | 10 | 37 | 37 | 11 | 4 | 7 | 18 | 1 | 34 | 37 | 37 |
| Bomb Hoax | 33 | 9 | 31 | 18 | 12 | 4 | 7 | 13 | 1 | 28 | 32 | 32 |
| Harbor Related Incidents | 33 | 6 | 11 | 31 | 11 | 4 | 7 | 14 | 1 | 16 | 17 | 17 |
| Prostitution | 30 | 4 | 15 | 19 | 9 | 4 | 6 | 11 | 1 | 16 | 20 | 20 |
| Phone Call Complaints | 19 | 7 | 18 | 19 | 11 | 4 | 7 | 14 | 1 | 19 | 19 | 19 |
| Arson | 17 | 6 | 16 | 17 | 9 | 4 | 7 | 16 | 1 | 15 | 16 | 16 |
| Aircraft | 14 | 1 | 1 | 13 | 7 | 4 | 7 | 10 | 1 | 2 | 2 | 2 |
| Explosives | 6 | 5 | 6 | 6 | 3 | 2 | 5 | 5 | 2 | 6 | 6 | 6 |